Big data! If you don’t have it, you better get yourself some. Your competition has it, after all. Bottom line: If your data is little, your rivals are going to kick sand in your face and steal your girlfriend.
There are many problems with the assumptions behind the “big data” narrative (above, in a reductive form) being pushed, primarily, by consultants and IT firms that want to sell businesses the next big thing. Fortunately, honest practitioners of big data—aka data scientists—are by nature highly skeptical, and they’ve provided us with a litany of reasons to be weary of many of the claims made for this field. Here they are:
Even web giants like Facebook and Yahoo generally aren’t dealing with big data, and the application of Google-style tools is inappropriate.
Facebook and Yahoo run their own giant, in-house “clusters”—collections of powerful servers—for crunching data. The necessity of these clusters is one of the hallmarks of big data. After all, data isn’t all that “big” if you could chew through it on your PC at home. The necessity of breaking problems into many small parts, and processing each on a large array of computers, characterizes classic big data problems like Google’s need to compute the rank of every single web page on the planet.
But it appears that for both Facebook and Yahoo, those same clusters are unnecessary for many of the tasks which they’re handed. In the case of Facebook, most of the jobs engineers ask their clusters to perform are in the “megabyte to gigabyte” range (pdf), which means they could easily be handled on a single computer—even a laptop.
The story is similar at Yahoo, where it appears the median task size handed to Yahoo’s cluster is 12.5 gigabytes. (pdf) That’s bigger than what the average desktop PC could handle, but it’s no problem for a single powerful server.
All of this is outlined in a paper from Microsoft Research, aptly titled “Nobody ever got fired for buying a cluster,” which points out that a lot of the problems solved by engineers at even the most data-hungry firms don’t need to be run on clusters. And why is that an issue? Because there are vast classes of problems for which clusters are a relatively inefficient—or even totally inappropriate—solution.
Big data has become a synonym for “data analysis,” which is confusing and counter-productive.
Analyzing data is as old as tabulating a record of all the Pharaoh’s bags in the royal granary, but now that you can’t say data without putting “big” in front of it, the—very necessary—practice of data analysis has been swept up in a larger and less helpful fad. Here, for example, is a post exhorting readers to “Incorporate Big Data Into Your Small Business” that is about a quantity of data that probably wouldn’t strain Google Docs, much less Excel on a single laptop.
Which is to say, most businesses are in fact dealing with what Rufus Pollock, of the Open Knowledge Foundation, calls small data. It’s very important stuff—a “revolution,” according to Pollock. But it has little connection to the big kind.
Supersizing your data is going to cost you and may yield very little.
Is more data always better? Hardly. In fact, if you’re looking for correlations—is thing X connected to thing Y, in a way that will give me information I can act on?—gathering more data could actually hurt you.
“The information you can extract from any big data asymptotically diminishes as your data volume increases,” wrote Michael Wu, the “principal scientist of data analytics” at social media analysis firm Lithium. For those of you who don’t normally think in data, what that means is that past a certain point, your return on adding more data diminishes to the point that you’re only wasting time gathering more.
One reason: The “bigger” your data, the more false positives will turn up in it, when you’re looking for correlations. As data scientist Vincent Granville wrote in “The curse of big data,” it’s not hard, even with a data set that includes just 1,000 items, to get into a situation in which “we are dealing with many, many millions of correlations.” And that means, “out of all these correlations, a few will be extremely high just by chance: if you use such a correlation for predictive modeling, you will lose.”
This problem crops up all the time in one of the original applications of big data—genetics. The endless “fishing expeditions” conducted by scientists who are content to sequence whole genomes and go diving into them looking for correlations can turn up all sorts of unhelpful results.
In some cases, big data is as likely to confuse as it is to enlighten.
When companies start using big data, they are wading into the deep end of a number of tough disciplines—statistics, data quality, and everything else that comprises “data science.” Just as in the kind of science that is published every day—and as often, ignored, revised, or never verified—the pitfalls are many.
Biases in how data are collected, a lack of context, gaps in what’s gathered, artifacts of how data are processed and the overall cognitive biases that lead even the best researchers to see patterns where there are none mean that “we may be getting drawn into particular kinds of algorithmic illusions,” said MIT Media Lab visiting scholar Kate Crawford. In other words, even if you have big data, it’s not something that Joe in the IT department can tackle—it may require someone with a PhD, or the equivalent amount of experience. And when they’re done, their answer to your problem might be that you don’t need “big data” at all.
So what’s better—big data or small?
Does your business need data? Of course. But buying into something as faddish as the supposed importance of the size of one’s data is the kind of thing only pointy-haired Dilbert bosses would do. The same issues that have plagued science since its inception—data quality, overall goals and the importance of context and intuition—are inherent in the way that businesses use data to make decisions. Remember: Gregor Mendel uncovered the secrets of genetic inheritance with just enough data to fill a notebook. The important thing is gathering the right data, not gathering some arbitrary quantity of it.